Work with different data formats (binary, netCDF, .mat)

Nikolay Koldunov

koldunovn@gmail.com

This is part of Python for Geosciences notes.

================


In [1]:
%matplotlib inline
import matplotlib.pylab as plt
import numpy as np

Binary data

Open binary


In [ ]:
!wget ftp://sidads.colorado.edu/pub/DATASETS/nsidc0051_gsfc_nasateam_seaice/final-gsfc/north/monthly/nt_200709_f17_v01_n.bin

Create file id:


In [2]:
ice = np.fromfile('nt_200709_f17_v01_n.bin', dtype='uint8')

We use uint8 data type. List of numpy data types

The file format consists of a 300-byte descriptive header followed by a two-dimensional array.


In [3]:
ice = ice[300:]

Reshape


In [4]:
ice = ice.reshape(448,304)

Simple visualisation of array with imshow (Matplotlib function):


In [5]:
plt.imshow(ice)
plt.colorbar();


To convert to the fractional parameter range of 0.0 to 1.0, divide the scaled data in the file by 250.


In [6]:
ice = ice/250.
plt.imshow(ice)
plt.colorbar();


Let's mask all land and missing values:


In [7]:
ice_masked = np.ma.masked_greater(ice, 1.0)
plt.imshow(ice_masked)
plt.colorbar();


Masking in this case is similar to using NaN in Matlab. More about NumPy masked arrays

Save binary


In [8]:
fid = open('My_ice_2007.bin', 'wb')
ice.tofile(fid)
fid.close()

In order to work with other data formats we need to use one of the SciPy submodules:

SciPy

General purpose scientific library (that consist of bunch of sublibraries) and builds on NumPy arrays.

We are going to use only scipy.io library.

scipy.io

Open .mat files

First we have to load function that works with Matlab files:


In [9]:
from scipy.io import loadmat

We are going to download Polar science center Hydrographic Climatology (PHC) for January in Matlab format.


In [ ]:
!wget https://www.dropbox.com/s/0kuzvz03gw6d393/PHC_jan.mat

Open file:


In [10]:
all_variables = loadmat('PHC_jan.mat')

We can look at the names of variables stored in the file:


In [11]:
all_variables.keys()


Out[11]:
['PTEMP1', 'LON', '__header__', '__globals__', 'DEPTH', 'LAT', '__version__']

We need only PTEMP1 (3d potential temperature).


In [12]:
temp = np.array(all_variables['PTEMP1'])

Check variable's shape:


In [13]:
temp.shape


Out[13]:
(33, 180, 360)

Show surface level:


In [14]:
plt.imshow(temp[0,:,:])
plt.colorbar();


Open netCDF files

Scipy have function for working with netCDF files, and you can import it with: from scipy.io import netcdf However it only supports netCDF3 format. It is better to use python netcdf4 module that have a lot of nice functionality. Moreover NCEP reanalysis data, that we are going to work with are in netCDF4 format.

Import nessesary function:


In [15]:
from netCDF4 import Dataset

I am going to download NCEP reanalysis data. Surface 4 daily air temperature for 2012.


In [ ]:
!wget ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis/surface/air.sig995.2012.nc

#Alternative for the times of US goverment shutdowns:
#!wget http://database.rish.kyoto-u.ac.jp/arch/ncep/data/ncep.reanalysis/surface/air.sig995.2012.nc

Create file id:


In [17]:
fnc = Dataset('air.sig995.2012.nc')

It's not really file id, it's netcdf_file object, that have some methods and attributes:


In [18]:
fnc.description


Out[18]:
u'Data is from NMC initialized reanalysis\n(4x/day).  These are the 0.9950 sigma level values.'

In [19]:
fnc.history


Out[19]:
u'created 2011/12 by Hoop (netCDF2.3)\nConverted to chunked, deflated non-packed NetCDF4 2014/09'

list variables


In [20]:
fnc.variables


Out[20]:
OrderedDict([(u'lat', <type 'netCDF4._netCDF4.Variable'>
float32 lat(lat)
    units: degrees_north
    actual_range: [ 90. -90.]
    long_name: Latitude
    standard_name: latitude
    axis: Y
unlimited dimensions: 
current shape = (73,)
filling on, default _FillValue of 9.96920996839e+36 used
), (u'lon', <type 'netCDF4._netCDF4.Variable'>
float32 lon(lon)
    units: degrees_east
    long_name: Longitude
    actual_range: [   0.   357.5]
    standard_name: longitude
    axis: X
unlimited dimensions: 
current shape = (144,)
filling on, default _FillValue of 9.96920996839e+36 used
), (u'time', <type 'netCDF4._netCDF4.Variable'>
float64 time(time)
    long_name: Time
    delta_t: 0000-00-00 06:00:00
    standard_name: time
    axis: T
    units: hours since 1800-01-01 00:00:0.0
    actual_range: [ 1858344.  1867122.]
unlimited dimensions: time
current shape = (1464,)
filling on, default _FillValue of 9.96920996839e+36 used
), (u'air', <type 'netCDF4._netCDF4.Variable'>
float32 air(time, lat, lon)
    long_name: 4xDaily Air temperature at sigma level 995
    units: degK
    precision: 2
    least_significant_digit: 1
    GRIB_id: 11
    GRIB_name: TMP
    var_desc: Air temperature
    dataset: NMC Reanalysis
    level_desc: Surface
    statistic: Individual Obs
    parent_stat: Other
    missing_value: -9.96921e+36
    actual_range: [ 191.1000061  323.       ]
    valid_range: [ 185.16000366  331.16000366]
unlimited dimensions: time
current shape = (1464, 73, 144)
filling on, default _FillValue of 9.96920996839e+36 used
)])

Access information about variables


In [21]:
air = fnc.variables['air']

This time we create netcdf_variable object, that contain among other things attributes of the netCDF variable as well as data themselves.


In [22]:
air.actual_range


Out[22]:
array([ 191.1000061,  323.       ], dtype=float32)

In [23]:
air.long_name


Out[23]:
u'4xDaily Air temperature at sigma level 995'

In [24]:
air.units


Out[24]:
u'degK'

In [25]:
air.shape


Out[25]:
(1464, 73, 144)

We can access the data by simply using array syntax. Here we show first time step of our data set:


In [26]:
plt.imshow(air[0,:,:])
plt.colorbar();


Save netCDF file

Minimalistic variant :)


In [27]:
!rm test_netcdf.nc
fw = Dataset('test_netcdf.nc', 'w')

fw.createDimension('t', 1464)
fw.createDimension('y', 73)
fw.createDimension('x', 144)

air_var = fw.createVariable( 'air','float32', ('t', 'y', 'x'))
air_var[:] = air[:]
fw.close()


rm: test_netcdf.nc: No such file or directory

More descriptive variant:


In [28]:
!rm test_netcdf.nc
fw = Dataset('test_netcdf.nc', 'w')

fw.createDimension('TIME', 1464)
fw.createDimension('LATITUDE', 73)
fw.createDimension('LONGITUDE', 144)

time = fw.createVariable('TIME', 'f', ('TIME',))
time[:] = fnc.variables['time'][:]
time.units = 'hours since 1-1-1 00:00:0.0' 

lat  = fw.createVariable('LATITUDE', 'f', ('LATITUDE',))
lat[:] = fnc.variables['lat'][:]

lon = fw.createVariable('LONGITUDE', 'f', ('LONGITUDE',))
lon[:] = fnc.variables['lon'][:]

ha = fw.createVariable('New_air','f', ('TIME', 'LATITUDE', 'LONGITUDE'))
ha[:] = air[:]
ha.missing_value = -9999.

fw.close()